Cytochrome P450 Enzyme Design by Constraining the Catalytic Pocket in a Diffusion Model

Although cytochrome P450 enzymes are the most versatile biocatalysts in nature, there is insufficient comprehension of the molecular mechanism underlying their functional innovation process. Here, by combining ancestral sequence reconstruction, reverse mutation assay, and progressive forward accumulation, we identified 5 founder residues in the catalytic pocket of flavone 6-hydroxylase (F6H) and proposed a “3-point fixation” model to elucidate the functional innovation mechanisms of P450s in nature. According to this design principle of catalytic pocket, we further developed a de novo diffusion model (P450Diffusion) to generate artificial P450s. Ultimately, among the 17 non-natural P450s we generated, 10 designs exhibited significant F6H activity and 6 exhibited a 1.3- to 3.5-fold increase in catalytic capacity compared to the natural CYP706X1. This work not only explores the design principle of catalytic pockets of P450s, but also provides an insight into the artificial design of P450 enzymes with desired functions.


Introduction
Cytochrome P450 enzymes (P450s) are ubiquitous in nearly all living organisms, playing pivotal roles in various meta bolic processes and pathways crucial for life, growth, and development [1].As the most versatile biocatalysts in nature, P450s not only catalyze more than 95% of the reported oxi dation and reduction reactions [2][3][4], but also are known as a "Universal catalyst" in industrial applications due to the abil ity of selective oxidation of inert carbon-hydrogen bonds under mild conditions [5,6].Therefore, obtaining new P450s with better properties has become an important goal in the field of bioengineering [7,8].In spite of huge functional diversity, most P450s share the same catalytic mechanism [9,10] and similar structural scaffolds [4].However, the catalytic pockets exhib ited significant variability in P450s with different functions (Fig. S1) [11].Moreover, the nonpolar composition and unique conformational flexibility of the substrate binding pockets are likely to enhance the capacity of these enzymes to modify their active sites and adapt to new substrates and selectivity [4].Considering the high evolvability of P450s, directed evolution has been extensively employed in engineering P450s with bet ter traits [12][13][14][15].However, this method often necessitates multiple rounds of random mutagenesis and highthroughput screening, making it challenging to exhaustively explore the potential protein space, whether in the laboratory or compu tationally [16].
The rapid development of deep learning has opened up a new method to acquire novel P450s with desired characteris tics.Even though impressive achievements have been witnessed in protein structure prediction [17,18], the desired functional design is still a big challenge [19,20].Recent developments in protein design leveraged by deep learning methods encompass a broad spectrum.These include designing sequences for fixed backbones [21], variable backbone design [22], and the direct generation of novel sequences and backbones within the natu ral protein space [23].These models employ various architec tures, including convolutional neural networks (CNNs), graph neural networks (GNNs), and Transformers, which are all instrumental in capturing the complex interactions between amino acids within a protein sequence [19].The abundance of sequence and structure data contributes to these deep learning models surpassing the performance of traditional physical or statistical models [24,25].However, when considering func tional design, it is impossible to collect sufficient highquality functional data to train a sophisticated model to create sequences with a desired function [26,27].Considering the current short age, an approach that fuses knowledgebased techniques to scrutinize the design principles of natural P450s with powerful deep learning models to expand the natural protein sequence space, may be appropriate for designing new P450s.As our comprehension of the fundamental mechanisms that govern the evolution of the catalytic pocket for functional innovation in natural P450s remains limited, elucidating the process by which a particular P450 adopts a new function becomes crucial in designing a new one.
In this work, we used a flavone 6hydroxylase (CYP706X1) from Erigeron breviscapus as an example, which belongs to the CYP706X subfamily and converts apigenin into scutellarein in the biosynthetic pathway of scutellarein (Fig. S2) [28].Firstly, we determined the founder residues constituting the catalytic pocket responsible for the functional innovation of the P450 gene through ancestral sequence reconstruction, reverse muta tion assay, progressive forward accumulation, and crystallo graphic analysis.Then, we elucidated the design principle of catalytic pocket for the functional innovation by an indepth structural analysis.Finally, we devised the P450Diffusion, an artificial P450 generative model, by integrating the catalytic pocket design principle with a denoising diffusion probabilistic model that has demonstrated outstanding performance in image generation [29].With the P450Diffusion model, we suc cessfully designed 10 artificial P450s with F6H activity, and one design outperforms the naturally bestperforming gene about 3.5fold, indicating the potential of P450Diffusion in the design of new P450 enzymes.

Functional innovation of F6H in CYP706 family
Among the characterized P450s in CYP706 family, only the P450s in CYP706X subfamily could catalyze the flavonoid sub strates, indicating that the F6H function may be de novo inno vated in the ancestor of the CYP706X subfamily (Fig. 1A and Fig. S3).Moreover, we found that the catalytic pocket's con figuration of CYP706X1 (i.e., EbF6H from E. breviscapus) is totally different from other P450s in the CYP706 family.The substrate apigenin could not even be properly positioned in other P450s with a C6prone reactive state, which refers to the molecular configuration that is best suited for binding to the catalytic pocket of the enzyme and undergoing a reaction (Fig. S4).Therefore, it provides us an opportunity to decipher the constructive mechanisms for the formation of F6H's catalytic pocket by comparing the neighboring genes in the CYP706 family.
We compared the evolutionary trajectory between the CYP706X subfamily and the most closely nonfunctional CYP706Y subfamily using ancestral sequence reconstruction (see the Phylogenetic analysis and ancestral sequence recon struction section).By testing the function of the inferred ances tral P450s for all key nodes in the phylogenetic tree (Fig. 1B), most ancestral subnodes in the CYP706X subfamily displayed significant F6H activity (Fig. 1C and Fig. S5).Conversely, the F6H function disappeared in both the common ancestor of CYP706X and CYP706Y subfamilies (ancXY) and the ancestor of CYP706Y subfamily (ancY) (Fig. 1C).Thus, the F6H's cata lytic pocket should be originated when the CYP706X subfamily diverged from the common ancestor of CYP706X and CYP706Y.To gain insight into the evolution of the catalytic pocket under lying functional innovation, we determined the crystal structure of ancX3, which was found to crystallize more readily after screening for crystallization conditions (Figs. S5 and S6 and Table S1).Indeed, the binding mode of apigenin in the common ancestor of the CYP706X subfamily (ancX) was obviously dif ferent from the nonfunctional ancXY, though they possessed very similar structural arrangement (RMSD < 1.0 Å, sequence identity = 83%) (Fig. 1D and E).In ancX, the substrate apigenin not only forms a strong π-π stacking interaction with residue Phe, but also forms an obvious hydrogen bond with the Oxo group of CpdI, which stabilized the substrate in a C6prone reactive state in ancX's catalytic pocket (Fig. 1D).However, in ancXY, the substrate apigenin could not form interactions like ancX with the residues in catalytic pocket and CpdI species, which means a nonC6prone reactive state in the catalytic pocket of ancXY, which takes away the possibility of ancXY being catalytic on apigenin (Fig. 1E).

Founder residues for functional innovation of F6H
In order to clarify the molecular mechanism of forming the catalytic pocket with F6H function, we proposed to analyze the changes of residue compositions between catalytic pockets of nonfunctional ancXY and functional ancX.Within 8 Å range of the active center, 16 out of 48 residues are different (Fig. 2A).Interestingly, when we replaced all of the 16 residues with the corresponding residues in ancX, the mutant (referred to as the ancXY16) obtained F6H function (Fig. 2B).Given that not all residues in the catalytic pocket contributed significantly to sub strate recognition and binding due to different locations of residues in 3dimensional space [30], we attempt to find out the founder residues in the catalytic pocket by the reverse muta tion assay (RMA: evaluating the mutational effect of each resi due within the ancXY16's catalytic pocket by reverting it to ancestral type).
Firstly, RMA was conducted on the 16 residues of ancXY16.We found that one of these mutations (A220L) inactivated the ancXY16, and 12 mutations significantly decreased the cata lytic activity, but 4 mutations (i.e., G111A, N119Q, F251L, and V307L) had less impact on the activity (Fig. 2B).Structural analysis showed that these 4 mutations were distant from the P450 catalytic center and were not involved in the changes in the residue's intrinsic hydrophilicity/hydrophobicity (see the Structural modeling and analysis section, Fig. S7).We excluded these 4 mutations to construct the ancXY12, which still shows F6H function.Furthermore, RMA against the 12 residues of ancXY12 revealed that 2 reverse mutations (i.e., A220L and T114I) would deactivate the enzyme, while the remaining 10 mutations only weakened enzyme activity, indicating that the actual number of founder residues is fewer than 12.
In order to identify founder residues more quickly, we used a progressive forward accumulation (PFA) strategy that progres sively added important mutations to ancXY until the mutant gains F6H function (Fig. 2C).Initially, we gradually added single mutation to ancXY according to the order of the RMA muta tional effect in ancXY12.Until 5 mutations were accumulated, the constructed ancXY5 (i.e., L220A/I114T/T317A/W123F/ L248M) displayed F6H activity.RMA analysis against ancXY5 showed that each of the 5 reverse mutations deactivated the enzyme, which verified the necessity of the 5 mutations in ancXY5.To further confirm the crucial roles of these 5 residues for F6H's functional innovation, we perform a second PFA according to the order of the RMA mutational effect in ancXY 16 (Fig. S8).The order of the RMA mutational effect in 2 rounds of RMA (ancXY16 and ancXY12) is very similar; only a minor discrepancy lies in S115T, which ranked fourth in the first round, dropped to sixth in the second round.Until 6 mutations are accumulated in the second PFA, the constructed ancXY6 (i.e., L220A/T317A/1114T/T115S/L248M/W123F) displayed F6H activity.RMA analysis of ancXY6 revealed only 5 reverse mutations (A220L, T114I, A317T, F123W, and M248L) deacti vated the enzyme, indicating that T115S mutation is not critical for the emergence of F6H function.
Therefore, these results above showed that the mutations of the 5 amino acids (L220A/I114T/T317A/W123F/L248M) play a founder role (referred to as founder residues in the following) in the F6H functional innovation process from ancXY to ancX.As to the other 11 residues, the structural analysis showed that these mutations decreasing the catalytic activity might play auxiliary roles in the enzyme catalysis due to no direct interac tions with the substrate apigenin (Figs.S7 and S9).

The principle of catalytic pocket for functional innovation of F6H
We further interpreted the underlying mechanism of 5 founder residues for functional innovation through an indepth analysis of the apigeninbinding model in ancXY5 (Fig. 3A).The 5 founder residues could be divided into 2 parts according to their roles in protein structure.The first part included I114T, W123F, and L248M, which mainly contributed to fix or bind the apigenin.For example, the I114T introduced a hydrogen bond with 7′ hydroxyl of apigenin with an energy contribution of 0.66 ± 0.10 kcal/mol (see the MD simulations and MMPB/ GBSA section, Fig. 3B).A null mutation of T114V in ancXY5 also ascertained the indispensability of this hydrogen bond for the F6H function (Fig. S10).The W123F contributed to the apigenin binding (−3.14±0.37 kcal/mol) with an aromatic π-π stacking interaction to the phenyl ring of the apigenin and alle viated the spatial conflicts caused by ancestral tryptophan in the ancXY (Fig. 3C).The L248M, located in the substrate access gate, was not only involved in the substrate tunneling process (Fig. 3D), but also contributed to the apigenin binding with a π stacking to the phenyl ring of apigenin.The second part included L220A, and T317A contributed to alleviate inappro priate interactions and space conflicts.The L220A alleviated the space conflict conducted by ancestral leucine and provided sufficient space for the placement of the B ring of substrate apigenin through the introduction of a small side chain (Fig. 3E).The T317A not only provided sufficient space for the placement of the A ring of apigenin but also avoided the wrong orientation apigeninbinding mode shown in nonfunctional ancXY caused by a hydrogen bond between the hydroxyl group of threonine and the substrate (Fig. 3F).
Based on the mutations of 5 founder residues, it appears that, with an appropriate spatial capacity (provided by small side chain residues A220 and A317), the catalytic pocket evolved following a "3point fixation" model.The "3point fixa tion" refers to essential interactions with 3 pivots in apigenin including the following: 4′OH of apigenin molecule (the first pivot) was fixed by the hydrogen bond from T114, the "B" ring of apigenin (the second pivot) was fixed by the π stacking inter actions from F123 and M248, and 7OH of apigenin (the third pivot) was fixed by the hydrogen bond with CpdI ironoxo moiety (Fig. S11).The model held the substrate apigenin in a reactive nearattack conformation (NAC), which maintained the relative orientation between the reaction site of apigenin and CpdI ironoxo moiety at a favorable distance and angle (3.6 Å and 155°), thus serving to initiate the 6hydroxylation reaction of apigenin in the catalytic process (Fig. S12).We pro pose that the "3point fixation" model could serve as the design principle for the catalytic pocket responsible for the natural functional innovation of F6H, which also offers us the potential to de novo design P450s with the desired functions.

Diffusion model-based designing of P450 with the specific function
Hundreds of thousands of P450 protein sequences collected in public databases offer us an opportunity to learn natural P450 sequence diversity and design new functional P450s [31].Recent advancements in diffusion models have shown signifi cant potential in enhancing the design of P450 enzymes with specific functions [29,32].Here, we proposed a P450 sequences diffusion model (P450Diffusion) to de novo design P450s with a desired function by combining the diffusion model with the design principle of F6H catalytic pocket (Fig. 4A).P450Diffusion Fig. 2. Reverse mutation assay and progressive forward accumulation for the identification of founder residues.(A) Sixteen different residues within the 8-Å range of active center of non-functional ancXY and functional ancX.All residues were represented as ball-and-stick model, and the residues of ancX and ancXY were colored cyan and magenta, respectively.(B) Reverse mutation assay (RMA) process for elucidating the impact of pocket residues on catalytic activity.Two rounds of RMA were conducted on ancXY-16 (light blue) and ancXY-12 (pink), respectively.(C) Progressive forward accumulation (PFA) process for accumulating crucial residues in F6H functional innovation.PFA process followed the order of RMA mutation effects in ancXY-12.The mutant ancXY-5 (rose) within the PFA process displayed initial indication of F6H functional innovation.RMA analysis against ancXY-5 showed that each of the 5 reverse mutations deactivated the enzyme (green).mainly consists of 2 models (i.e., pretrained and finetuning diffusion models).Firstly, 226,509 natural P450 sequences were collected to train a pretrained P450 sequence diffusion model.This pretrained model consists of 2 subprocesses: a forward diffusion subprocess, which gradually adds Gaussian noise to the representation of P450 sequence until it becomes random noise, and a reverse generation subprocess, which starts from random noise and gradually denoises the representation of P450 sequence to generate a new P450 sequence.After 150,547 training rounds, the pretrained diffusion model could gener ate a wide variety of sequences, with similarities to natural sequences ranging from 20% to 50%.Secondly, 19,202 P450 sequences with appreciable similarity to CYP706X subfamily were used to finetune the pretrained diffusion model for ensuring that the generated sequences have a similar structural backbone to the F6H.Besides, the 5 founder residues including T114, F123, A220, M248, and A317 were constrained to ensure the reproduction of the "3point fixation" design principle in de novo generated sequences.The model integrating training set finetuning with constrained generation was referred to as the finetuning diffusion model.
Furthermore, we used the finetuning diffusion model to generate a total of 60,000 nonnatural P450 sequences, which share about 50% average amino acid identity to that of the natural sequences.In comparison with natural P450s, the gen erated sequences not only have a highly similar distribution of Shannon entropies for each position in multiple sequence align ments, but also display very consistent residue-residue co evolution patterns and physicochemical properties (Figs.S13  and S14).However, the generated sequences can be grouped into smaller clusters and interpolated between the natural sequence clusters, indicating that the generated sequences have higher diversity than natural P450s (Fig. 4B).It is noteworthy that the sequences generated by the finetuning P450Diffusion model form a larger cluster, exhibiting greater similarity to the CYP706X subfamily, thereby demonstrating the effectiveness of the finetuning model.Besides, we compared the distribution of 5 founder residues among natural and generated P450s (Fig. 4C).It is found that except the threonine (T) in position 317, other positions are highly variable in natural and generated P450s from pretrained model, even in natural P450s from CYP706 family.However, all 5 founder residues are relatively conserved in the generated P450s from the finetuning model, indicating that the P450Diffusion possessed the capability of generating sequences with an amino acid distribution similar to that of natural F6H on the basis of constrained 5 founder residues.

Experimental verification and structural insights of de novo generated P450s
Finally, we experimentally tested whether the generated sequences from P450Diffusion were true P450 enzymes, and performed F6H function.In order to accurately obtain func tional sequences from numerous designs, we conducted virtual screening on 60,000 generated sequences based on 3 specific criteria: the computational scores of composite metrics for assessing the quality of generated sequences, the 3dimensional pocket constraints of the 5 founder residues, and the robust ness of the apigenin binding modes (see the "Computational evaluation and structurebased virtual screening for generated sequences" section, Fig. 4A).Following virtual screening, 17 promising designs were meticulously selected for further exploration.These designs were subsequently synthesized and expressed within yeast expression systems.Notably, these designs exhibited sequence identities ranging from 70% to 87% when compared to CYP706X1, underlining their potential as novel catalysts (Table S2).The recombinant yeasts were culti vated for 4 days by feeding apigenin as substrate and HPLC analysis revealed 10 designs with significant F6H activity (Fig. 5A).Surprisingly, there are 6 designs that exhibited a 1.3 to 3.5fold increase in scutellarein production compared to CYP706X1 (Fig. 5B).The 4 remaining active designs also dis played comparable activities with other natural F6H enzymes (i.e., Cnan706X and Lsal706X).Therefore, the results indicated that the P450Diffusion not only could capture the fundamental design principle of F6H catalytic pocket and effectively generate P450s sequences with F6H activity, but also selected out the better P450 enzymes compared to natural sequences from the P450 sequence space.
Furthermore, we presented a structural perspective on the active designs as well as the distinctions between the active and inactive ones.The structural and sequence analysis reveals that nearly all mutations in the generated designs are distant from the substrate binding pocket (Fig. 5C, exemplified by non functional Design91808 and functional Design4129, and Fig. S15).Additionally, the binding orientations of active designs closely resemble those of natural CYP706X1 (Fig. S16).However, molecular dynamics (MD) simulations have demonstrated significantly weaker binding stability of the sub strate apigenin in the inactive designs when compared to the active ones (Fig. 5D).This discrepancy likely serves as the pri mary reason for the inactivity observed in these 7 designs.Besides, we observed that the overall protein structures of the active designs appear to exhibit greater stability than the inac tive ones following extensive MD simulations (Fig. 5E).Notably, significant structural fluctuations are observed, particularly within the sequence ranges of 220 to 230 and 390 to 410, as illustrated in the inactive designs (Fig. S17).For instance, in Design33380, the R229K mutation disrupts the salt bridge with E251, while the S230P mutation causes a break in the alpha helix structure (Fig. S18).In Design91808, the S407L muta tion breaks the hydrogen bond with the backbone of A51, resulting in a less stable protein backbone than observed in active designs (Fig. S19).These results imply that the amino acid mutations on the surface of the protein could lead to a reduction in the global stability of the protein, which further leads to incorrect folding and substrate binding instability, and ultimately to the loss of activity of the designs.
In order to further analyze the other 7 designs without F6H activity, we tested the P450 expression in yeast expression sys tems by integrating green fluorescent protein (GFP) at the Cterminal.All recombinant proteins successfully showed green fluorescence (Fig. S20).Considering that the GFP is a very stable protein whose incorrect folding of N terminal protein would not affect the brightening of GFP and the GFP itself could also enhance the folding and dissolving, we next design a protein soluble expression experiment.Since the pro tein expression level in the yeast system is low, it is very difficult to isolate and purify when expressed in yeast cells.In this study, the E. coli system, which could produce more proteins, was used to express these designed P450s.We confirmed the expres sion of P450s through Nterminal truncation, protein purifica tion, SDSPAGE gel analysis, and spectral analysis of the P450 enzyme generated in this study (Supplementary Information, Figs.S21 and S22).Results indicate that all proteins were expressed; however, those with higher catalytic activity showed better soluble expression in the E. coli system.This analysis provided us with valuable insights for future improvements of the P450 generative model.

Discussion
Nature has evolved an amazing array of enzymes to catalyze biological functions and enabled living systems to face diverse environmental challenges [33].Gene duplication contributes most to the generation of new enzymes [34], especially for cyto chrome P450s, which evolve to the largest enzyme family for plant metabolism by widespread wholegenome and tandem duplications [7,35,36].Although most duplicates are lost or subfunctionalized by purifying or neutral selection, neofunc tionalization often happened in P450 evolution due to high plasticity and variability of catalytic pockets [37,38].The evo lutionary trajectory of P450's functional innovation have attracted researchers' attention for a long time [39][40][41] and the previous researches were mainly focused at the gene level [42][43][44][45][46] or residue level [47,48].In this study, based on ances tral sequence reconstruction, RMA and PFA, we identified 5 founder residues in the catalytic pocket of flavone 6hydroxylase (F6H).However, considering that there are several muta tions left that also affect F6H function, we speculated that these 5 residues may represent the most probable combination for F6H functionality, and there might be other residue com binations with similar effects.
Upon conducting additional structural analysis of the api geninbinding model in ancXY5, we have formulated a "3point fixation" model to elucidate the design principle of the catalytic pocket that played a pivotal role in the functional innovations of F6H function.The "3point fixation" model seems to be a general principle for the substrate binding in P450's catalytic pocket, such as the camphor binding in P450cam [49] and Npalmitoyl glycine binding in P450BM3 [50].Similar fixation rules could also be found in the general enzymatic catalysis where the substrates or catalytic residues are held in the catalytic pockets [51], even as a term commonly used in medicine and architecture [52].It is worth mentioning that besides the "3point fixation" model, nature also evolved other catalytic pocket design principles for functional innovations in P450s.For example, the SbaiCYP82D4, as the isoenzyme of CYP706X1, has evolved to a completely different catalytic pocket configura tion for flavone 6hydroxylation [53,54].The catalytic pocket of SbaiCYP82D4 consisted of more residues with strong hydro phobicity, and no obvious hydrogen bond was found between surrounding residues and substrate apigenin, making the sub strate bind in an "oblique binding" orientation (Fig. S23), which is distinguished with the "vertical binding" orientation in CYP706X1.Although a different substrate binding model was found in SbaiCYP82D4, the substrate apigenin also formed a reactive conformation in a NAC model to enable the initiation of the catalytic reaction.This fact indicated that substrates in P450s could be held in favorable orientations with different fix ing rules under the premise of sufficient space and suitable shape for the placement of the substrate.
P450 enzymes, as a type of membrane protein, are extremely difficult to express and purify in large quantities, which also increases the difficulty of designs and modifications for P450 enzymes.Under conventional conditions, yeast, as a eukaryotic microorganism, provides suitable conditions for the correct fold ing of membrane proteins like P450s due to its endomembrane system.However, due to the need to extract peroxisomes and the low expression levels of proteins, there are few cases of puri fying P450 proteins from yeast systems.Here, we attempted the designed P450s folding tests in both yeast and E. coli systems.Preliminary results showed that all proteins were expressed, but in the E. coli system, proteins with higher catalytic activity exhib ited better soluble expression.Currently, the purification of membrane proteins such as P450s does indeed face significant challenges, and it is believed that with the advancement of pro tein purification technologies, further indepth studies on pro tein expression will be realized.Techniques for increasing solubility and ensuring correct folding are also promising direc tions for future research in P450 protein designs.
The rapid development of deep learning has witnessed many impressive achievements in protein structure design, while the desired functional design is still a big challenge [55][56][57].Our research provides a novel strategy for the de novo design of P450s with specific function by coupling the design principle of catalytic packet with deep learning model.In this study, non natural P450s with F6H function were successfully designed by integrating the "3point fixation" model with a denoising diffusion probabilistic model.The structural analysis of active designs suggested that the design principle of F6H catalytic pocket has been fully incorporated into the deep learning model.Furthermore, the structural insights between active and inactive designs suggest that mutations on protein surface may be the fundamental factors contributing to the inactivity or reduced activity of designed sequences, providing us with valu able insights for future improvements of the P450 generative model.More structure or sequencebased features should be considered, like the substratetunneling feature, the overall stability of protein, and so on.
In general, the current work provides insights into the prin ciple of pocket design in the P450 functional innovations and offers a potential research paradigm for the de novo design of P450 enzymes with desired functions.With the increase of in depth investigated P450s, more catalytic pocket design prin ciples would be deciphered and facilitate the design of P450s with novel and desired functions.

Phylogenetic analysis and ancestral sequence reconstruction
The P450 sequences of CYP706 subfamilies were selected from the previous study [28], including 10 P450s of CYP706X/Y subfamilies for ancestral sequence reconstruction, and a P450 of CYP706W subfamily as an outgroup.The transmembrane domains of P450 sequences were annotated with the TMHMM package [58].Using the crystal structures of CYP76AH1 (PDB ID: 5YLW), a structural informationbased sequence align ment of the P450s deprived of Ntransmembrane region was generated by Expresso [59].Poorly aligned regions (N and Ctermini) were trimmed.Then, a phylogenetic ML tree was created with the RAxML [60].All protein sequences of ances tral nodes were deduced using FastML [61,62].The N and Cterminal amino acids including transmembrane domain derived from CYP706X1 were added to each ancestor.Ultimately, we obtained the most probable ancestor of CYP706Y subfamily (ancY) and CYP706X subfamily (ancX), the common ancestor of 2 subfamilies (ancXY), and all subancestors of CYP706Y subfamily (ancY1, ancY2, and ancY3) and CYP706X subfamily (ancX1, ancX2, and ancX3) in the subnodes of the phyloge netic tree (Fig. 1B).The ancestral sequences are available in the Supplementary information.

Crystallization and structure solution
Initial crystallization screening was performed using the sittingdrop vapordiffusion method with commercial crys tal screen kits at 16 °C.The ancX3 protein at a concentration of 10 mg/ml in buffer (2 mM KH 2 PO 4 , 8 mM K 2 HPO 4 , 500 mM NaCl, 0.2 mM EDTA, 1 mM DTT, and 10% [v/v] glycerol, pH 7.4) was used in the initial crystallization screening to determine the crystallization condition.The ancX3 protein was mixed with precipitant solution at a drop size of 0.6+0.6 μl against the reservoir containing 50 μl of precipitant solu tion.The crystals grew from the mixture with the precipitant solution consisting of 1.34 M NaCl, 13.4% (w/v) PEG3350, 0.1 M MgCl 2 , and 0.1 M imidazole, pH 6.5.The crystal opti mization was performed using the hangingdrop vapor diffusion method at 16 °C against the reservoir containing 0.5 ml of the precipitant solution.The drops contained 2 μl of precipitant solution, 2 μl of ancX3 protein, and 0.2 μl of additive solution (40% v/v polypropylene glycol P400) from the Hampton additive screen kit.
Crystals of ancX3 were mounted from the crystallization drops in nylon loops and flashfrozen in liquid nitrogen using the cryoprotectant consisting of 1.34 M NaCl, 13.4% (w/v) PEG3350, 0.1 M MgCl 2 , 0.1 M imidazole, and 25% (v/v) glyc erol, pH 6.5.Diffraction data (λ = 0.97918 Å) were collected on beamlines 17U1 at Shanghai Synchrotron Radiation Facility for IFS crystals.Diffraction images were indexed, integrated, and scaled using the XDS program.Details of the data collec tion statistics are summarized in Table S1.
The structure of ancX3 was solved by molecular replacement with the structure of CYP76AH1 (PDB code: 5YLW) as search model [63].Iterative model building and refinement were per formed using COOT and PHENIX, respectively.Coordinates and structure factors have been deposited with the PDB under accession id 8JC2.

Structural modeling and analysis
Structure modeling: The 3D models of ancestral proteins and generated designs are predicted by the local ColabFold algo rithm through inputting the crystal structure of ancX3 as one of the templates [64].
Molecular docking: The Cartesian coordinates and atom charges of CpdI were obtained from published data [65].The structure of substrate apigenin was obtained from PubChem [66], and assigned with AM1BCC charges [67].An ensemble of different conformations of the substrate were generated by enumerating these under OpenBabel [68].Substrate rotamers were extensively sampled around the C2-C1′ axis with 5° inter vals.The mol2formatted CpdI and apigenin were parameterized with molfile_to_params.pyscript.Before molecular docking, the protein structure complex with CpdI species was firstly sampled and minimized by the RosettaRelax protocol without constraints [69,70].Then, the apigenin was docked into relaxed structures using RosettaLigand [71][72][73].Distance restraints were added between the Fe ion and ligated cysteine (2.3 Å ± 0.1 Å), between carboxylate groups of heme and arginines (2.2 Å ± 0.4 Å) in RosettaScripts [74].Each run of 100,000 models was generated with the MPI [75] version of RosettaLigand, and the top 100 models with the lowest Rosetta energy unit were clustered with Calibur [76], and the structures with the lowest binding free energy (interface_delta) were selected as our final docking mod els.The Rosetta scripts and option files for RosettaLigand are available in the Supplementary Information.
Pocket analysis: The hydrophilicity calculation and hydro phobic surface area evaluation of CYP pockets are implemented with fpocket software [77].The dpocket function of fpocket2 was used to extract binding pocket descriptors for different P450s.

MD simulations and MM-PB/GBSA
The ancestral proteins underwent 100ns simulations using MD to optimize their complex structures.Additionally, the complex structures of the generated designs underwent a 300ns MD simulation for further structural optimization and to observe overall protein and substrate fluctuations within the catalytic pockets.All simulations were performed in triplicate.
System setup: Our target models with CpdI and substrate molecules were set as the initial structures for MD simulation.The protein structures were prepared with the pdb4amber application in the Amber20 package [78].The force field for the CpdI species was taken from published data [65].The par tial atomic charges and missing parameters for substrate api genin were generated by Antechamber with an AM1BCC charge model [79,80].Na + ions (0.15 M) were added to the protein surface to neutralize the total charge of the system.Finally, the resulting system was solvated in a rectangular box of TIP3P waters extending up to a minimum cutoff of 12 Å from the protein boundary.The Amber ff14SB force field was employed for all the proteins in MD simulations.
MD simulations: After proper parameterizations and setup, the resulting systems were minimized with 2 steps (the first step with 5,000 steps of steepest descent and 10,000 steps of conjugate gradient, and the second step with 10,000 steps of steepest descent and 30,000 steps of conjugate gradient) to remove the poor contacts and relax the systems.The systems were then gently annealed from 0 to 300 K under the NVT ensemble for 50 ps with a restraint of 5 kcal mol −1 Å −2 .Subsequently, the systems were maintained for a total of 5 rounds of density equilibration of 20 ps in the NPT ensemble at a target temperature of 300 K and a target pressure of 1.0 atm using the Langevin thermostat [81] with a restraint of 1 kcal mol −1 Å −2 .A total of 5 rounds of density equilibration relaxed the system to achieve a uniform density after heating dynamics under periodic boundary conditions.Thereafter, we removed all of the restraints applied during heating and density dynamics and further equilibrated the systems for ∼2 ns to get a wellsettled pressure and temperature for conformational and chemical analyses.This was followed by an MD production run for each of the systems.During all of the MD simulations, the covalent bonds containing hydrogen were constrained using SHAKE [82] and particlemesh Ewald [83] was used to treat longrange electrostatic interactions.All of the MD simu lations were performed with the GPU version of the Amber 20 package.
MMPB/GBSA: The python script mmpbsa.py[84] in Amber20 package was used in this research to analyze the bind ing free energy of apigenin.According to the systematic research of Hou et al., the inclusion of the conformational entropy may be crucial for the prediction of absolute binding free energies but not for ranking the binding affinities of similar ligands [85].The binding free energy analysis is implemented here just for analyzing the interaction energy contribution of each key resi due.Therefore, the change of conformational entropy upon ligand binding has been ignored in our calculation because of expensive computational cost and low prediction accuracy.The calculation procedure mainly referred to the MMPBSA protocol in AMBER tutorial websites (https://ambermd.org/tutorials/advanced/tutorial3/index.php).

Building and training the P450Diffusion
Denoising diffusion probability models (or diffusion models, for short) work by applying a Markov process to corrupt the training data by successively adding Gaussian noise, then learn ing to recover the data by reversing this denoising process [86].We adapt this framework to generate protein sequences, intro ducing necessary modifications to encode the discrete protein sequences into a vector of a specific length.We used physico chemical characterbased schemes, the principal components score Vectors of Hydrophobic, Steric, and Electronic properties (VHSE8, [87]), to encode protein sequences.The P450Diffusion is composed of a UNet with selfattention layers and features a classical Ushaped structure with downsampling and up sampling blocks.
To build the P450Diffusion, we screened and analyzed all potential P450s from a published P450 database [31] and public databases, filtering out sequences with a length greater than 560 and resulting in 226,509 sequences to form the training dataset.Then, we encode the training dataset, where each amino acid in the protein sequence is encoded as an 8dimensional vector, and each batch protein sequence is encoded as a 64×1×560×8 vector.Here, 64 is the batch size equal to the number of samples in the training data; 1 represents the chan nel size; 560 represents the maximum length of the protein sequence; and 8 represents the VHSE8 encode vector for each amino acid in the protein sequence.If the protein sequence is shorter than 560, we add gaps until it reaches a length of 560.In this case, we assign a vector of 8 zeroes as the encoding for gaps.Then, we started to train the pretrained P450 sequence diffusion model.After 150,547 training steps, the loss functions of the pretrained diffusion model converged and the model was obtained (Fig. S24A).
In order to generate sequences with the F6H function more effectively, we finetune the pretrained diffusion model with the filtered dataset by selecting sequences with more than 30% amino acid identity to the CYP706X subfamily and clustering them with 90% sequence similarity.Finally, a total of 19,234 sequences formed a finetuning dataset.Meanwhile, we assigned different sample weights to 30 sequences from the CYP706X subfamily and other sequences in the finetuning dataset.The sampling weight ratio between the 30 sequences from the CYP706X subfamily and other sequences was 600:1.The P450Diffusion was obtained after 150,500 training steps (Fig. S24B).
The P450Diffusion architecture to generate P450 sequences was based on the diffusion model.The diffusion model is com posed of a UNet with selfattention layers.The main difference with traditional UNet is that the upsampling and down sampling blocks support an extra timestep argument on their forward pass.This is done by embedding the timestep linearly into the convolutions.In the training process, the network takes a batch of noisy protein sequences of shape (batch size, chan nels, height, and width) and a batch of noise levels of shape (batch size, 1) as input, and returns a tensor of shape (batch size, channels, height, and width).In this model, we used a mean squared error loss (MSELoss) function and optimized the networks with the AdamW algorithm, setting the learning rate to 2e4.Our model was implemented in PyTorch and trained on 6 GeForce RTX 3090 systems for about 150,000 steps, which took approximately 63 h.

Computational evaluation and structure-based virtual screening for generated sequences
Three criteria were used to screen the generated sequences in silico to improved experimental validation success rates: the computational scores of composite metrics for assessing the quality of generated sequences, the 3dimensional pocket con straints of the 5 founder residues, and the robustness of the apigenin binding modes.Details are as follows.
We used random protein sequences of length 560 with the 5 founder residues as the starting sequence for the diffusion model sample.In the reverse diffusion process, we perform 600 steps of denoising the 60,000 starting sequences to obtain 60,000 generated sequences.In order to increase the likelihood that the generated sequences would function as F6H, we evalu ated the generated protein sequences using a variety of com putational metrics, including esm1v [88], Alphafold2 [18], ProteinMPNN [89], and others [90].Firstly, the 60,000 gener ated sequences were screened by the sequence motif con structed by the 5 founder residues, and 77 sequences were filtered out.Secondly, both the 77 generated sequences and the F6H sequences were scored for esm1v, and then the top 33 sequences in the esm1v results were selected for alphafold2 structure modeling.Thirdly, the constructed structures and sequences were evaluated using ProteinMPNN, and the top 19 designs were selected based on their ProteinMPNN scores, which were higher than that of CYP706X1 (−1.63).Fourthly, substrate apigenin and CpdI were docked into constructed structures using RosettaLigand and the substratebinding mod els were obtained based on binding affinity (interface_delta_X).Subsequently, MD simulations were performed to evaluate the overall structure stability and binding pocket stability for each designed sequences.Finally, 17 substratebinding structures that meet the catalytic pocket constraints constituted by founder residues and maintain stable substrate binding modes were chosen as candidate sequences for experimental verifica tion (Fig. S25).

Fig. 1 .
Fig. 1.De novo innovation of F6H function in CYP706X subfamily.(A) Phylogenetic relationship of 5 characterized genes in the CYP706 family.The CYP73A1 was set as an out-group.The maximum likelihood tree was constructed and all nodes received bootstrap support values from 100 replicates.(B) Phylogenetic tree of CYP706X and CYP706Y subfamilies.The inferred ancestral nodes are annotated with bold representations.CYP706X1 referred to F6H in E. breviscapus.(C) HPLC analysis of the fermented products of ancXY (the ancestor of CYP706X and CYP706Y subfamilies), ancY (the ancestor of CYP706Y subfamily), ancX (the ancestor of CYP706X subfamily), and CYP706X1.(D and E) Substrate-binding models of apigenin in the catalytic pocket of ancX (D) and ancXY (E).The dashed lines represented the hydrogen bond interactions.

Fig. 3 .
Fig. 3. Contribution of 5 founder residues for forming the reactive near-attack conformation.(A) Spatial conformation of 5 founder residues (cyan), substrate apigenin (green), and CpdI (light blue).(B to F) Comparison of each founder residue interacting with substrate in ancX and ancXY.The substrate apigenin in ancX and ancXY is in green and white, respectively.The founder residue in ancX and ancXY is in cyan and white, respectively.

Fig. 4 .
Fig. 4. P450Diffusion de novo design new P450 processes from scratch.(A) The design process for the new P450 includes P450Diffusion model construction (a pre-trained model and a fine-tuning model), sequence generation, and screening and experimental verification.The generated sequences were screened and evaluated to obtain the candidate sequences for experimental verification.(B) t-SNE embedding of natural, pre-trained model, and fine-tuning model generated sequences.The protein sequence space was visualized by transforming a distance matrix derived from k-tuple measures of protein sequence alignment into a t-SNE embedding.Dot sizes represent the 50% identity cluster size for each representative.(C) The distribution of 5 founder residues and control residues (CPG) among natural and generated P450s is illustrated in the WebLogo using multiple sequence alignment (MSA) [91].This visualization incorporates data from 4 distinct sources: the P450Diffusion pre-trained model dataset (Nal P450), sequences generated by the P450Diffusion pre-trained model (Pre-trained), the P450Diffusion fine-tuning model dataset (Nal F6H), and sequences generated by the P450Diffusion fine-tuning model (Fine-tuning).

Fig. 5 .
Fig. 5. Experimental verification and structural insights of de novo generated P450s.(A) The product scutellarein peak area of 17 designs, compared with natural Cnan706X, Lsal706X, and CYP706X1.Different colors were assigned to different proteins.(B) The histogram displays the peak areas of products associated with functional designs, with CYP706X1 used as the control group.(C) The structural distribution of mutations within the generated designs, with non-functional Design91808 and functional Design4129 as examples, were contrasted with that observed in CYP706X1.Mutations were visualized as red and green spheres for comparative analysis.(D) The boxplot illustrates the substrate RMSD values across MD simulations, with active designs depicted in green and inactive designs in red.(E) The boxplot represents the RMSD values for the overall protein structure across MD simulations, with active designs shown in green and inactive designs shown in red.
and H. Chu conducted deep learning work.C.S. performed the crystallization of ancX3.Other authors contributed to collating experimental results.Z. Chang and J.C. revised the paper.H.J. conceived and directed the project.Competing interests: The authors declare that they have no competing interests.